Skip to main content

Streaming Large Data Pickle Files in Python

Streaming Large Data to/from Pickle Files in Python

Yes, you can process pickle files in a streaming fashion to handle very large datasets without loading everything into memory at once. Here's how to do it:

Streaming Write to Pickle File

import pickle

def stream_write_to_pickle(data_iterable, filename):
"""Write data to pickle file in streaming fashion"""
with open(filename, 'wb') as f:
for item in data_iterable:
pickle.dump(item, f)
# Clear memory after writing if needed
del item

# Example usage:
def generate_large_data():
"""Generator that yields dictionaries one by one"""
for i in range(1000000): # 1 million items
yield {'id': i, 'data': 'x' * 100} # Example large dictionary

# Write in streaming fashion
stream_write_to_pickle(generate_large_data(), 'large_data.pkl')

Streaming Read from Pickle File

import pickle
import os

def stream_read_from_pickle(filename):
"""Generator that reads pickle file item by item"""
with open(filename, 'rb') as f:
while True:
try:
yield pickle.load(f)
except EOFError:
break

# Example usage:
# Process one item at a time without loading all into memory
for item in stream_read_from_pickle('large_data.pkl'):
# Process your item here
print(f"Processing item {item['id']}")
# Item will be garbage collected after processing

Advanced Version with Batch Processing

For better performance with very large files, you can process in batches:

def batch_read_from_pickle(filename, batch_size=1000):
"""Read pickle file in batches"""
batch = []
with open(filename, 'rb') as f:
while True:
try:
batch.append(pickle.load(f))
if len(batch) >= batch_size:
yield batch
batch = []
except EOFError:
if batch: # yield remaining items
yield batch
break

# Example usage:
for batch in batch_read_from_pickle('large_data.pkl', batch_size=500):
print(f"Processing batch of {len(batch)} items")
# Process the batch
del batch # free memory explicitly

Important Notes

  1. File Structure: This creates a pickle file containing multiple pickled objects in sequence (not a single list).

  2. Performance: Streaming is slower than bulk operations but uses much less memory.

  3. Compatibility: All items must be pickle-able (most Python objects are).

  4. Error Handling: The EOFError catch is how we detect the end of file.

  5. Memory Management: Explicit del statements help with memory management for very large items.

This approach lets you handle datasets much larger than your available RAM since you only keep one item (or a small batch) in memory at a time.

Batch Processing with Pickle (5 Dictionaries at a Time)

Here's how to write and read a list of dictionaries in batches of 5 using Python's pickle module:

Writing in Batches (5 at a time)

import pickle

def write_batches_to_pickle(data_list, filename, batch_size=5):
"""
Write list of dictionaries to pickle file in batches
:param data_list: Your list of dictionaries
:param filename: Output pickle file
:param batch_size: Number of items per batch (default 5)
"""
with open(filename, 'wb') as f:
for i in range(0, len(data_list), batch_size):
batch = data_list[i:i + batch_size]
pickle.dump(batch, f)

# Clear the original list if needed
data_list.clear()

# Example usage:
large_list = [{'id': x, 'data': f'sample{x}'} for x in range(100000)] # 100k items
write_batches_to_pickle(large_list, 'batched_data.pkl')

Reading in Batches (5 at a time)

def read_batches_from_pickle(filename):
"""
Generator that yields batches of dictionaries from pickle file
:param filename: Pickle file containing batched data
:yields: Lists of dictionaries (batches of 5)
"""
with open(filename, 'rb') as f:
while True:
try:
yield pickle.load(f)
except EOFError:
break

# Example usage:
for batch in read_batches_from_pickle('batched_data.pkl'):
print(f"Processing batch of {len(batch)} items:")
for item in batch:
print(f" - Item ID: {item['id']}")
# Batch will be garbage collected after processing

Complete Workflow Example

# Create sample data (100k dictionaries)
data = [{'id': i, 'value': i*2} for i in range(100000)]

# Write in batches of 5
write_batches_to_pickle(data, 'big_data.pkl')

# Verify original list is cleared
print(f"Original list length after writing: {len(data)}") # Should be 0

# Process data in batches later
total_items = 0
for batch in read_batches_from_pickle('big_data.pkl'):
print(f"Processing batch (size: {len(batch)})")
total_items += len(batch)
# Your processing code here

print(f"Total items processed: {total_items}") # Should be 100000

Key Advantages

  1. Memory Efficiency: Only 5 dictionaries are loaded at a time during processing
  2. Simple Implementation: Uses only standard library (pickle)
  3. Flexible Batch Size: Easily adjustable by changing the batch_size parameter
  4. Preserved Structure: Maintains the original list structure through batching

This approach gives you fine-grained control over memory usage while keeping the implementation simple and dependency-free.